The Annals of Applied Statistics 
2008, Vol. 2, No. 1, 176-196 
DOI: 10.1214/07-AOAS143 

© Institute of Mathematical Statistics. 2008 

ON REGRESSION ADJUSTMENTS IN EXPERIMENTS WITH 
SEVERAL TREATMENTS 

By David A. Freedman 

University of California, Berkeley 

Regression adjustments are often made to experimental data. 
Since randomization does not justify the models, bias is likely; nor 
are the usual variance calculations to be trusted. Here, we evaluate 
regression adjustments using Neyman's nonparametric model. Pre- 
vious results are generalized, and more intuitive proofs are given. A 
bias term is isolated, and conditions are given for unbiased estimation 
in finite samples. 

1. Introduction. Data from randomized controlled experiments (includ- 
ing clinical trials) are often analyzed using regression models and the like. 
The behavior of the estimates can be calibrated using the nonparametric 
model in Neyman (1923), where each subject has potential responses to sev- 
eral possible treatments. Only one response can be observed, according to 
the subject's assignment; the other potential responses must then remain un- 
observed. Covariates are measured for each subject and may be entered into 
the regression, perhaps with the hope of improving precision by adjusting 
the data to compensate for minor imbalances in the assignment groups. 

As discussed in Freedman (2006, 2007), randomization does not justify the 
regression model, so that bias can be expected, and the usual formulas do not 
give the right variances. Moreover, regression need not improve precision. 
Here, we extend some of those results, with proofs that are more intuitive. 
We study asymptotics, isolate a bias term of order 1/n, and give some special 
conditions under which the multiple regression estimator is unbiased in finite 
samples. 

What is the source of the bias when regression models are applied to 
experimental data? In brief, the regression model assumes linear additive 
effects. Given the assignments, the response is taken to be a linear combina- 
tion of treatment dummies and covariates, with an additive random error; 
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coefficients are assumed to be constant across subjects. The Neyman model 
makes no assumptions about linearity and additivity. If we write the ex- 
pected response given the assignments as a linear combination of treatment 
dummies, coefficients will vary across subjects. That is the source of the bias 
(algebraic details are given below). 

To put this more starkly, in the Neyman model, inferences are based on 
the random assignment to the several treatments. Indeed, the only stochastic 
element in the model is the randomization. With regression, inferences are 
made conditional on the assignments. The stochastic element is the error 
term, and the inferences depend on assumptions about that error term. 
Those assumptions are not justified by randomization. The breakdown in 
assumptions explains why regression comes up short when calibrated against 
the Neyman model. 

For simplicity, we consider three treatments and one covariate, the main 
difficulty in handling more variables being the notational overhead. There 
is a finite population of n subjects, indexed by i = 1, . . . ,n. Defined on this 
population are four variables a, b,c,z. The value of a at i is etj, and so forth. 
These are fixed real numbers. We consider three possible treatments, A, B, C. 
If, for instance, i is assigned to treatment A, we observe the response etj, but 
do not observe hi or q. 

The population averages are the parameters of interest here: 



This could be measured directly, at the expense of losing all information 
about b and c. To estimate all three parameters, we divide the population 
at random into three sets A,B,C, of fixed sizes ua, tib, tic- If « € A, then 
i receives treatment A] likewise for B and C. We now have a simple model 
for a clinical trial. As a matter of notation, A stands for a random set as 
well as a treatment. 

Let U, V, W be dummy variables for the sets. For instance, U{ = 1 if i G A 
and Ui = otherwise. In particular, J2i Ui = tla, and so forth. Let xa be the 
average of x over A, namely, 



Plainly, a a = J2ieA a i/ n A is an unbiased estimator, called the "ITT esti- 
mator," for a. Likewise for B and C. "ITT" stands for intention-to-treat. 
The idea, of course, is that the sample average is a good estimator for the 
population average. The intention-to-treat principle goes back to Bradford 
Hill (1961); for additional discussion, see Freedman (2006). There is at least 



(1) 




For example, a is the average response if all subjects are assigned to A. 



(2) 
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one flaw in the notation: x A is a random variable, being the average of x over 
the random set A. By contrast, n A is a fixed quantity, being the number of 
elements in A. 

In the Neyman model, the observed response for subject i = 1, . .. , n is 



because a, b, c code the responses to the treatments. If, for instance, i is 
assigned to A, the response is a^. Furthermore, Ui = 1 and Vi = Wi = 0, so 
Yj = Oj. In this circumstance, 6j and q would not be observable. 

We come now to multiple regression. The variable z is a covariate. It is 
observed for every subject, and is unaffected by assignment. Applied work- 
ers often estimate the parameters in (1) by a multiple regression of Y on 
U, V,W,z. This is the multiple regression estimator whose properties are to 
be studied. The idea seems to be that estimates are improved by adjusting 
for random imbalance in assignments. 

The standard regression model assumes linear additive effects, so that 



where (5 is constant across subjects. However, the Neyman model makes 
no assumptions about linearity or additivity. As a result, E(Yi\U,V,W, z) 
is given by the right-hand side of (3), with coefficients that vary across 
subjects. The variation in the coefficients contradicts the basic assumption 
needed to prove that regression estimates are unbiased [Freedman (2005), 
page 43]. The variation in the coefficients is the source of the bias. 

Analysts who fit (4) to data from a randomized controlled experiment 
seem to think of fix as estimating the effect of treatment A, namely, a in (1). 
Likewise, (3^ ~ Pi is used to estimate c — a, the differential effect of treatment 
C versus A. Similar considerations apply to other effects. However, these 
estimators suffer from bias and other problems to be explored below. 

We turn for a moment to combinatorics. Proposition 1 is a well-known 
result. (All proofs are deferred to the Appendix at the end of the article.) 

Proposition 1. Let ps = n s /n for S = A,B or C. 



(3) 



Y i = a i U i + b i Vi + c i W i 



(4) 



E{Yi\U,V,W,z 



)=PiU i + (3 2 V i + (3 3 W i + (3 4 Zi 



(i) E{x A ) =x. 




(iii) cov(x A ,y A ) 

(iv) cov(x A: y B ) 




= -^rCOv(x,y). 



Here, x,y = a,b,c or z. Likewise, A in (i)-(iii) may be replaced by B or 
C. And A,B in (iv) may be replaced by any other distinct pair of sets. By 
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cov(x,y), for example, we mean 

1 n 

Y](xi-x)(yi-y). 
n 

%=i 

Curiously, the result in (iv) does not depend on the fractions of subjects 
allocated to the three sets. We can take x = z and y = z. For instance, 

cov(z A ,z B ) = — -var(z). 

n — 1 

The finite-sample multivariate CLT in Theorem 1 below is a minor vari- 
ation on results in Hoglund (1978). The theorem will be used to prove the 
asymptotic normality of the multiple regression estimator. There are several 
regularity conditions for the theorem. 

Condition #1. There is an a priori bound on fourth moments. For all 
n = 1, 2, . . . and x = a, b, c or z, 

1 n 

(5) —^2 \xi\ 4 < L < oo. 

n i=i 

Condition #2. The first- and second-order moments, including mixed 
moments, converge to finite limits, and asymptotic variances are positive. 
For instance, 

1 n 

(6) ~H a i^( a ) 

it. 1 

i=i 



-. n i n 



and 

(7) ±J2^^(a 2 ), -J2^bi^(ab), 
n r— f n f— f 

with 

(8) (a 2 ) > (a) 2 ; 

likewise for the other variables and pairs of variables. Here, (a) and so forth 
merely denote finite limits. We take (a 2 ) and (a, a) as synonymous. In present 
notation, (a) is the limit of a, the latter being the average of a over the 
population of size n; see (1). 

Condition #3. We assume groups are of order n in size, that is, 
PA = tia/ti — > PA > 0, pB = n B /n^p B >0, 

(9) 

PC = rac/n^ pc > 0, 
where pa + P£ + Pc = 1- Notice that p,4, for instance, is the fraction of 
subjects assigned to A at stage n; the limit as n increases is pa- 
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Condition #4. The variables a,b,c,z have mean 0: 
1 n 

(10) — } X{ = 0, where x = a, b, c, z. 

n f— f 

i=i 

Condition $4 is a normalization for Theorem 1. Without it, some center- 
ing would be needed. 

Theorem 1 (The CLT). Under Conditions #l-#4> the joint distribu- 
tion of the 12 -vector 

Vn(a A ,a B ,ac ...,zc) 
is asymptotically normal, with parameters given by the limits below: 

(i) E(y/nx A ) = 0; 

(ii) var(^/nxA) -> {x 2 )(l ~Pa)/pa; 

(hi) cov(^/nx A ,\/ny A ) -> (x, y)(l - p A )/p A ; 
(iv) cov(^/nxA,\/ny B ) -> ~{x,y). 

Here, x,y = a,b,c or z. Likewise, A in (i)-(iii) may be replaced by .B or 
C. And A, i? in (iv) may be replaced by any other distinct pair of sets. The 
theorem asserts, among other things, that the limiting first- and second- 
order moments coincide with the moments of the asymptotic distribution, 
which is safe due to the bound on fourth moments. (As noted above, proofs 
are deferred to a Technical Appendix at the end of the article.) 

Example 1. Suppose we wish to estimate the effect of C relative to A, 
that is, c — a. The ITT estimator is Yo — Y A = cc — a A , where the equality 
follows from (3). As before, Yo = J2iec ^i/ n c = J2iec °i/ n c- The estimator 
Yq — Y A is unbiased by Proposition 1, and its exact variance is 



1 



n — 1 



1 - PA , , . ^~PC , x . , , 
var(a) H : var(c) + 2 cov(o, c) 



PA PC 

By contrast, the multiple regression estimator would be obtained by fit- 
ting (4) to the data, and computing A = (3$ — @± . The asymptotic bias and 
variance of this estimator will be determined in Theorem 2 below. The per- 
formance of the two estimators will be compared in Theorem 4. 



2. Asymptotics for multiple regression estimators. In this section we 
state a theorem that describes the asymptotic behavior of the multiple re- 
gression estimator applied to experimental data: there is a random term of 
order 1/y/n and a bias term of order 1/n. As noted above, we have three 
treatments and one covariate z. The treatment groups are A,B,C, with 
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dummies U, V, W. The covariate is z. If i is assigned to A, we observe the 
response Oj, whereas &j,Cj remain unobserved. Likewise for B,C. The co- 
variate Zi is always observed, and is unaffected by assignment. The response 
variable Y is given by (3). In Theorem 1, most of the random variables — like 
as or bA — are unobservable. That may affect the applications, but not the 
mathematics. Arguments below involve only observable random variables. 

The design matrix for the multiple regression estimator will have n rows 
and four columns, namely, U, V,W,z. The estimator is obtained by a regres- 
sion of Y on U, V,W,z, the first three coefficients estimating the effects of 
A,B,C, respectively. Let Pmr be the multiple regression estimator for the 
effects of A,B,C. Thus, (3mr is a 3 x 1 -vector. 

We normalize z to have mean and variance 1: 

n -i n 

(11) -$> = o, -£*? = 1 - 

n ~ n r— f 

i=i i=i 

The mean-zero condition on z overlaps Condition #4, and is needed for 
Theorem 2. There is no intercept in our regression model; without the mean- 
zero condition, the mean of z is liable to confound the effect estimates. 
See the Appendix for details. (In the alternative, we can drop one of the 
dummies and put an intercept into the regression — although we would now 
be estimating effect differences rather than effects.) The condition on the 
mean of z 2 merely sets the scale. 

Recall that pa is the fraction of subjects assigned to treatment A. Let 

(12) Q = paoz +p~Bbz +pccz 
and 

(13) Q=p A (az) +p B {bz) + pc (cz). 

Here, for instance, az = J27=i a i z %l n is t ne average over the study population. 
By Condition #2, as the population size grows, ~az = Y^i=i a i z i/ n ~^ i az )'i 
likewise for b and c. Thus, 

(14) Q^Q. 

The quantities Q and Q are needed for the next theorem, which demon- 
strates asymptotic normality and isolates the bias term. To state the theo- 
rem, recall that (3mr is the multiple regression estimator for the three effects. 
The estimand is 

(15) 0=(a,b,c)', 

where a,b,c are defined in (1). Define the 3x3 matrix E as follows: 
En = — lim var(a — Qz), 

, . PA 

(16) 

E12 = — limcov(a — Qz, b — Qz), 
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and so forth. The limits are taken as the population size n — > oo, and exist 
by Condition ^2. Let 

(17) Cn = \fn{a A - Qz A , b B - Qz B ,c c - Qz c )' ■ 

This turns out to be the lead random element in /3mr — 0- The asymptotic 
variance-covariance matrix of Q n is E, by (14) and Theorem 1. For the bias 
term, let 

(18) Ka = cov(az, z) — pa cov(az, z) — ps cov(bz, z) — pc cov(az, z), 
and likewise for Kb,Kq- 

Theorem 2. Assume Conditions j^l-j^3, not Condition an d (H)- 
Define ( n by (17), and K s by (18) for S = A,B, C. Then E(( n ) = and ( n 
is asymptotically iV(0, S) . Moreover, 

(19) p M R- P = (n/Vn- K/n + p n , 
where K = (Ka,Kb,Kc)' and p n = 0(l/n 3 ^ 2 ) in probability. 

Remarks, (i) If K = 0, the bias term will be <9(l/n 3 / 2 ) or smaller. 

(ii) What are the implications for practice? In the usual linear model, (3 
is unbiased given X. With experimental data and the Neyman model, given 
the assignment, results are deterministic. At best, we will get unbiasedness 
on average, over all assignments. Under special circumstances (Theorems 
5 and 6 below), that happens. Generally, however, the multiple regression 
estimator will be biased. See Example 5. The bias decreases as sample size 
increases. 

(hi) Turn now to random error in [3. This is of order \jyfn, both for 
the ITT estimator and for the multiple regression estimator. However, the 
asymptotic variances differ. The multiple regression estimator can be more 
efficient than the ITT estimator — or less efficient — and the difference persists 
even for large samples. See Examples 3 and 4 below. 

3. Asymptotic nominal variances. "Nominal" variances are computed 
by the usual regression formulae, but are likely to be wrong since the usual 
assumptions do not hold. We sketch the asymptotics here, under the condi- 
tions of Theorem 2. Recall that the design matrix X is n x 4, the columns 
being U, V,W,z. The response variable is Y. The nominal covariance matrix 
is then 

(20) S nom = <7 2 (A / A)- 1 , 

where a 2 is the sum of the squared residuals, normalized by the degrees of 
freedom (n — 4). Recall Q from (13). Let 

(21) a 2 = lim [p^var(a) +p#var(&) +pc var(c)] — Q 2 , 
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where the limit exists by Conditions #2 and #3. Let 

(PA 0^ 
(22) D= ° VR ° 






V o 



Pb 





PC 








1/ 



Theorem 3. Assume Conditions #1~#3, not Condition #4; an d (H)- 
Define a 2 by (21) and D by (22). In probability, 

(i) X'X/n^D, 

(ii) a 2 -a 2 , 

(hi) nS nom -> a 2 !? l . 

What are the implications for practice? The upper left 3x3 block of 
a 2 D~ l will generally differ from E in Theorem 2, so the usual regression 
standard errors — computed for experimental data — can be quite mislead- 
ing. This difficulty does not go away for large samples. What explains the 
breakdown? In brief, the multiple regression assumes (i) the expectation 
of the response given the assignment variables and the covariates is linear, 
with coefficients that are constant across subjects; and (ii) the conditional 
variance of the response is constant across subjects. In the Neyman model, 
(i) is wrong as noted earlier. Moreover, given the assignments, there is no 
variance left in the responses. 

More technically, variances in the Neyman model are (necessarily) com- 
puted across the assignments, for it is the assignments that are the random 
elements in the model. With regression, variances are computed condition- 
ally on the assignments, from an error term assumed to be IID across sub- 
jects, and independent of the assignment variables as well as the covariates. 
These assumptions do not follow from the randomization, explaining why the 
usual formulas break down. For additional discussion, see Freedman (2007). 

An example may clarify the issues. Write coVqo for limiting covariances, 
for example, 

covoo(a,z) = limcov(a, z) = (az) — (a)(z) = (az) 
because (z) = by (11); similarly for variances. See Condition #2. 

Example 2. Consider estimating the effect of C relative to A, so the 
parameter of interest is c — a. By way of simplification, suppose Q = 0. Let A 
be the multiple regression estimator for the effect difference. By Theorem 3, 
the nominal variance of A is essentially 1 jn times 

(l + — ] varoo(a) + [ 1 + — ] var^c) + [ — + — Vsvar CX) (6). 
V PC J \ PAJ \PA Pc J 
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By Theorem 2, however, the true asymptotic variance of A is 1/n times 

( 1^ varoo(a) + ( l) varoo(c) + 2covoo(a,c). 

\PA J \PC J 

For instance, we can take the asymptotic variance-covariance matrix of 
a,b,c,z to be the 4x4 identity matrix, with pa=Pc = 1/4 so pn = 1/2. 
The true asymptotic variance of A is 6/n. The nominal asymptotic variance 
is 8/n and is too big. On the other hand, if we change var oc (6) to 1/4, the 
true asymptotic variance is still 6/n; the nominal asymptotic variance drops 
to 5/n and is too small. 

4. The gain from adjustment. Does adjustment improve precision? The 
answer is sometimes. 

Theorem 4. Assume Conditions #l~-#3, not Condition #4, and (11). 
Consider estimating the effect of C relative to A, so the parameter of in- 
terest is c — a. If we compare the multiple regression estimator to the ITT 
estimator, the asymptotic gain in variance is T / \npAPc) > where 

(23) T = 2Q[pc(az)+p A (cz)}- Q 2 \p A + p c ] , 

with Q defined by (13). Adjustment therefore helps asymptotic precision if 
r > 0, but hurts ifT < 0. 

The next two examples are set up like Example 2, with coVqo for limit- 
ing covariances. We say the design is balanced if n is a multiple of 3 and 
n A = n B = n c — n/3. We say that effects are additive if 6j — a\ is constant 
over i and likewise for q — a^. With additive effects, var 00(a) =var 0O (6) = 
varoo(c); write v for the common value. Similarly, covoo(a,z) =covoo(6, z) = 
covoo(c, z) = Q = P\/v, where p is the asymptotic correlation between a and 
z, or b and z, or c and z. 

Example 3. Suppose effects are additive. Then cov^a, z) = coVoo(6, z) = 
cov DO (c, z) = Q and V = Q 2 (pa +Pc) > 0. The asymptotic gain from adjust- 
ment will be positive if cov^a, z) / 0. 

Example 4. Suppose the design is balanced, so pa = Pb = Pc = 1/3. 
Then 3Q = cov OQ (a, z)+covoo(b, z) + cov OQ (c, z). Consequently, 3r/2 = Q[2Q — 
cov DO (6, z)\. Let z = a + b+c. Choose a, b, c so that vai oc (z) = 1 and cov^a, b) = 
cov 00(0,0) = covoo(6, c) = 0. In particular, Q = 1/3. Now 2Q — covoo(b,z) = 
2/3 — varoo(6). The asymptotic gain from adjustment will be negative if 
varoo(6) >2/3. 
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Example 3 indicates one motivation for adjustment: if effects are nearly 
additive, adjustment is likely to help. However, Example 4 shows that even in 
a balanced design, the "gain" from adjustment can be negative — if there are 
subject-by-treatment interactions. More complicated and realistic examples 
can no doubt be constructed. 

5. Finite-sample results. This section gives conditions under which the 
multiple regression estimator will be exactly unbiased in finite samples. Ar- 
guments are from symmetry. As before, the design is balanced if n is a 
multiple of 3 and ua = nB — nc = ra/3; effects are additive if 6, — aj is con- 
stant over i and likewise for Cj — a% . Then <n — a = hi — b = C{ — c = Si, say, 
for all i. Note that J2i $i = 0- 

Theorem 5. If (11) holds, the design is balanced, and effects are addi- 
tive, then the multiple regression estimator is unbiased. 

Examples show that the balance condition is needed in Theorem 5: ad- 
ditivity is not enough. Likewise, if the balance condition holds but there is 
nonadditivity, the multiple regression estimator will usually be biased. We 
illustrate the first point. 

Example 5. Consider a miniature trial with 6 subjects. Responses a, b, c 
to treatments A,B,C are shown in Table 1, along with the covariate z. 
Notice that b — a = 1 and c — a = 2. Thus, effects are additive. We assign 
one subject at random to A, one to B, and the remaining four to C. There 
are 6 x 5/2 = 15 assignments. For each assignment, we build up the 6x4 
design matrix (one column for each treatment dummy and one column for 
z); we compute the response variable from Table 1 above, and then the 
multiple regression estimator. Finally, we average the results across the 15 
assignments, as shown in Table 2. The average gives the expected value of 
the multiple regression estimator, because the average is taken across all 
possible designs. "Truth" is determined from the parameters in Table 1. 
Calculations are exact, within the limits of rounding error; no simulations 
are involved. 

For instance, the average coefficient for the A dummy is 3.3825. However, 
from Table 1, the average effect of A is a = 1.3333. The difference is bias. 
Consider next the differential effect of B versus A. On average, this is esti- 
mated by multiple regression as 1.9965 — 3.3825 = —1.3860. From Table 1, 
truth is +1. Again, this reflects bias in the multiple regression estimator. 
With a larger trial, of course, the bias would be smaller; see Theorem 2. 
Theorem 5 does not apply because the design is unbalanced. 
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Table 1 
Parameter values 



a 


b 


c 


z 





1 


2 








1 


2 








1 


2 





2 


3 


4 


-2 


2 


3 


4 


-2 


4 




6 


4 



For the next theorem, consider the possible values v of z. Let n v be the 
number of i with z\ = v. The average of given Zi = v is 

- E 

,l V {■ 1 

Suppose this is constant across v's, as is z i= v} °i/n- v , J2{i: z t =v} c i/ n v 
The common values must be a, b, c, respectively. We call this conditional 
constancy. No condition is imposed on z, and the design need not be bal- 
anced. (Conditional constancy is violated in Example 5, as one sees by look- 
ing at the parameter values in Table 1.) 

Theorem 6. With conditional constancy, the multiple regression esti- 
mator is unbiased. 

Remarks, (i) In the usual regression model, Y = X(3+e with E{e\X) = 0. 
The multiple regression estimator is then conditionally unbiased. In Theo- 
rems 5 and 6, the estimator is conditionally biased, although the bias aver- 
ages out to across permutations. In Theorem 5, for instance, the conditional 
bias is (X'X)~ 1 X'5. Across permutations, the bias averages out to 0. The 
proof is a little tricky (see the Technical Appendix below). The 5 is fixed, as 
explained before the theorem; it is X that varies from one permutation to 
another; the conditional bias is a nonlinear function of X. This is all quite 
different from the usual regression arguments. 



Table 2 

Average multiple regression estimates versus truth 





Ave MR 


Truth 


A 


3.3825 


1.3333 


B 


1.9965 


2.3333 


C 


2.9053 


3.3333 


z 


-0.0105 





12 



D. A. FREEDMAN 



(ii) Kempthorne (1952) points to the difference between permutation 
models and the usual linear regression model; see Chapters 7-8, especially 
Section 8.7. Also see Biometrics vol. 13, no. 3 (1957). Cox (1956) cites 
Kempthorne, but appears to contradict Theorem 5 above. I am indebted to 
Joel Middleton for the reference to Cox. 

(hi) When specialized to two-group experiments, the formulas in this 
paper (for, e.g., asymptotic variances) differ in appearance but not in sub- 
stance from those previously reported [Freedman (2007)]. 

(iv) Although details have not been checked, the results (and the argu- 
ments) in this paper seem to extend easily to any fixed number of treatments, 
and any fixed number of covariates. Treatment by covariate interactions can 
probably be accommodated too. 

(v) In this paper treatments have two levels: low or high. If a treatment 
has several levels — for example, low, medium, high — and linearity is assumed 
in a regression model, inconsistency is likely to be a consequence. Likewise, 
we view treatments as mutually exclusive: if subject i is assigned to group A, 
then i cannot also turn up in group B. If multiple treatments are applied to 
the same subject in order to determine joint effects, and a regression model 
assumes additive or multiplicative effects, inconsistency is again likely. 

(vi) The theory developed here applies equally well to 0-1 valued re- 
sponses. With 0-1 variables, it may seem more natural to use logit or pro- 
bit models to adjust the data. However, such models are not justified by 
randomization — any more than the linear model. Preliminary calculations 
suggest that if adjustments are to be made, linear regression may be a safer 
choice. For instance, the conventional logit estimator for the odds ratio may 
be severely biased. On the other hand, a consistent estimator can be based 
on estimated probabilities in the logit model. For discussion, see Freedman 
(2008). 

(vii) The theory developed here can probably be extended to more com- 
plex designs (like blocking) and more complex estimators (like two-stage 
least squares), but the work remains to be done. 

(viii) Victora, Habicht and Bryce (2004) favor adjustment. However, they 
do not address the sort of issues raised here, nor are they entirely clear 
about whether inferences are to be made on average across assignments, or 
conditional on assignment. In the latter case, inferences might be strongly 
model-dependent. 

(ix) Models are used to adjust data from large randomized controlled 
experiments in, for example, Cook et al. (2007), Gertler (2004), Chattopad- 
hyay and Duflo (2004) and Rossouw et al. (2002). Cook et al. report on long- 
term followup of subjects in experiments where salt intake was restricted; 
conclusions are dependent on the models used to analyze the data. By con- 
trast, the results in Rossouw et al. for hormone replacement therapy do not 
depend very much on the modeling. 



REGRESSION ADJUSTMENT 



13 



6. Recommendations for practice. Altman et al. (2001) document per- 
sistent failures in the reporting of data from clinical trials, and make detailed 
proposals for improvement. The following recommendations are complemen- 
tary: 

(i) As is usual, measures of balance between the assigned-to-treatment 
group and the assigned-to-control group should be reported. 

(ii) After that should come a simple intention-to-treat analysis, com- 
paring rates (or averages and SDs) of outcomes among those assigned to 
treatment and those assigned to the control group. 

(hi) Crossover should be discussed, and deviations from protocol. 

(iv) Subgroup analyses should be reported, and corrections for crossover 
if that is to be attempted. Analysis by treatment received requires special 
justification, and so does per protocol analysis. (The first compares those 
who receive treatment with those who do not, regardless of assignment; the 
second censors subjects who cross over from one arm of the trial to the other, 
e.g., they are assigned to control but insist on treatment.) Complications are 
discussed in Freedman (2006). 

(v) Regression estimates (including logistic regression and proportional 
hazards) should be deferred until rates and averages have been presented. If 
regression estimates differ from simple intention-to-treat results, and reliance 
is placed on the models, that needs to be explained. As indicated above, the 
usual models are not justified by randomization, and simpler estimators may 
be more robust. 



The Appendix provides technical underpinnings for the theorems dis- 
cussed above. 

Proof of Proposition 1. We prove only claim (iv). Plainly, E(UiVj) = 
if i = j, since i cannot be assigned both to A and to B. Furthermore, 



if i 7^ j. This is clear if i = 1 and j = 2; but permuting indices will not change 
the joint distribution of assignment dummies. We may assume without loss 
of generality that x = y = 0. Now 



TECHNICAL APPENDIX 



E(UiVj) = P(Ui = lkVj = l) 



n n — 1 



cov(x A ,yB) 



1 1 



^EiUiVjXiyj) 



nA n B 



i+3 
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= i 1 T\ I Y Xi Y yj ~ Y x iVi ) 

n(n-l) VY j i J 

= — , 1 t-\ Y x ^ = — ~r c °v(x, v) 

n(n — 1 ~ n — 1 

as required, where i,j = l,...,n. □ 

Proof of Theorem 1. The theorem can be proved by appealing to 
Hoglund (1978) and computing conditional distributions. Another starting 
point is Hoeffding (1951), with suitable choices for the matrix from which 
summands are drawn. With either approach, the usual linear-combinations 
trick can be used to reduce dimensionality. In view of (9), the limiting dis- 
tribution satisfies three linear constraints. 

A formal proof is omitted, but we sketch the argument for one case, 
starting from Theorem 3 in Hoeffding (1951). Let a,/3,7 be three constants. 
Let M be an n x n matrix, with 

{aaj, for i = 1, . . . , ha, 
(3b j, for i = riA + 1, . . . ,tia +n B , 
jCj, for i = riA + r»s + 1, • • • > n. 

Pick one j at random from each row, without replacement (interpretation: 
if j is picked from row % = 1, . .. ,ha, subject j goes into treatment group 
A). According to Hoeffding's theorem, the sum of the corresponding matrix 
entries will be approximately normal. So the law of ^/n(a,A, bs, cc) tends 
to multivariate normal. Theorem 1 in Hoeffding's paper will help get the 
regularity conditions in his Theorem 3 from Conditions #1— #4 above. □ 

Let X be an n x p matrix of rank p<n. Let Y be an n x 1 vector. The 
multiple regression estimator computed from Y is j3y = (X 1 X)~ l X'Y . Let 
6 be a p x 1 vector. The "invariance lemma" is a purely arithmetic result; 
the well-known proof is omitted. 

Lemma A.l. The invariance lemma, fiy+xe = $Y + 9- 

The multiple-regression estimator for Theorem 2 may be computed as fol- 
lows. Recall from (2) that Ya is the average of Y over A, that is, J2ieA ^i/ n A', 
likewise for B, C. Let 

(Al) e i = Y i -Y A Ui-Y B Vi-YcW i , 

which is the residual when Y is regressed on the first three columns of the 
design matrix. Let 



(A2) 



fi = Zi- z A Ui - z B Vi - z c Wi, 
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which is the residual when z is regressed on those columns. Let Q be the 
slope when e is regressed on /: 

(A3) Q = e-f/\f\ 2 . 

The next result is standard. 

Lemma A. 2. The multiple regression estimator for the effect of A, that 
is, the first element in (X' X)~ 1 X'Y , is 

(A4) Y A - Qz A 

and likewise for B,C. The coefficient of z in the regression ofY on U, V, W, z 
is Q. 

We turn now to Q; this is the key technical quantity in the paper, and 
we develop a more explicit formula for it. Notice that the dummy variables 
U, V, W are mutually orthogonal. By the usual regression arguments, 

(A5) |/| 2 = \z\ 2 - n A {z A ) 2 - n B (z B ) 2 - n c (z c ) 2 , 

where |/| 2 = J2?=i fl ■ Recall (3). Check that Y A = a A , where a A = J2ieA a i/ n A', 
likewise for B,C. Hence, 

(A6) a = { ai - a A )Ui + (h - b B )Vi + (a - c C )W h 

where the residual was defined in (Al). Likewise, 
(A7) h = ( Zi - z A )Ui + ( Zi - z B )Vi + ( Zi - z c )Wi, 

where the residual fi was defined in (A2). Now 

&ifi = (ai - a A )(zi - z A )Ui + (hi - b B )(zi - z B )Vi 

+ (cj - cc)(zi - z c )Wi 



(A8) 
and 



(A9) 



^eifi =n A [(az) A -a A z A ] +n B [(bz) B - b B z B ] 
i=i 

+ n c [(cz) c -ccz c ], 

where, for instance, (az) A = Yl,i^A a i z il n A- 

Recall that p A = n A /n is the fraction of subjects assigned to treatment 
A; likewise for B and C. These fractions are deterministic, not random. We 
can now give a more explicit formula for the Q defined in (A3), dividing 
numerator and denominator by n. By (A5) and (A9), 

Q = N/D, where 

(A10) N = p A [(az) A - a A z A ] +p B [(bz) B - b B z B ] +Pc[(cz) c - c c z c ], 

D = l-p A (z A ) 2 -p B [z B ) 2 -pc(zc) 2 - 
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In the formula for D, we used (11) to replace \z\ 2 jn by 1. 

The reason Q matters is that it relates the multiple regression estimator 
to the ITT estimator in a fairly simple way. Indeed, by (3) and Lemma A. 2, 

Pmr = {Ya - Qz A ,Y B - Qz B ,Y c - Qzc)' 

(AH) 

= {a a ~ Qza,o-b ~ Qz B ,a c ~ Qzc)' '■ 

We must now estimate Q. In view of (11), Theorem 1 shows that 

(A12) (zA,z B ,zc) = 0(l/V^)- 

(All O's are in probability.) Consequently, 

(A13) the denominator D of Q in (A10) is 1 + 0(l/n). 

Two deterministic approximations to the numerator N were presented in 
(12)-(13). 

Proof of Theorem 2. By Lemma A.l, we may assume a = b = c = 0. 
To see this more sharply, recall (3). Let (3 be the result of regressing Y on 
U, V,W,z. Furthermore, let 

(A14) Y* = (a, + a*)Ui + (b t + b*)Vi + (a + c*)W l . 

The result of regressing Y* on U,V,W,z is just $ + (a*, b* , c*, 0)'. So the 
general case of Theorem 2 would follow from the special case. That is why 
we can, without loss of generality, assume Condition $4. Now 

(A15) (a A ,b B ,cc) = 0(l/y/n). 

We use (A10) to evaluate (All). The denominator of Q is essentially 1, 
that is, the departure from 1 can be swept into the error term p n , because 
the departure from 1 gets multiplied by (za,Zb,zc)' = 0(l/y/n). This is 
a little delicate, we are estimating down to order 1/ra 3 / 2 . The departure 
of the denominator from 1 is multiplied by N, but terms like claza are 
0(l/n) and immaterial, while terms like (clz)a ar e 0(1) by Condition 
and Proposition 1 (or see the discussion of Proposition A.l below). 

For the numerator of Q, terms like claza go into p n : after multiplication 
by (za, z B , zc)' , they are 0(l/n 3 / 2 ). Recall that az = J2i'=i a i z i/ n - What's 
left of the numerator is Q + Q, where 

(A16) Q = pA(az-az) A +p B (bz - bz) B + p c (cz -cz)c- 

The term Q(za, z B , zc)' goes into ( n ; see (17). The rest of Cn comes from 
(ciA,b B ,cc) in (All). The bias in estimating the effects is therefore 

(A17) -£;jQ^ 
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This can be evaluated by Proposition 1, the relevant variables being az, bz, cz, z. 
□ 



Additional detail for Theorem 2. We need to show, for instance, 

1 



Qza = Qza + Qza + O 

This can be done in three easy steps. 
Step 1. 

N ( 1 

-za = Nza + 0[ 



3/2 



Z? ' Vn 3 /2 

Indeed, AT = 0(1), D = 1 + 0(±), and ^ = O(^). 

Step 2. N = Q + Q — R, where R = Paclaza + Pb^bze + Pccczc- This is 
because (az)^ = az, and so forth. 
Step 3.R = 0(±) so fe A = O(^j). 

Remarks, (i) As a matter of notation, Q is deterministic but Q is ran- 
dom. Both are scalar: compare (12) and (A16). The source of the bias is the 
covariance between Q and za, Zb, zq- 

( ii) Suppose we add a constant k to z. Instead of (All), we get z = k 
and z 2 = 1 + k 2 . Because za and so forth are all shifted by the same amount 
k, the shift does not affect e, / or Q; see (A1)-(A3). The multiple regression 
estimator for the effect of A is therefore shifted by Qk; likewise for B,C. 
This bias does not tend to when sample size grows, but does cancel when 
estimating differences in effects. 

(iii) In applications, we cannot assume the parameters a, 6,c are — the 
whole point is to estimate them. The invariance lemma, however, reduces 
the general case to the more manageable special case, where a = b = c = 0, 
as in the proof of Theorem 2. 

(iv) In (19), K = 0(1). Indeed, z = 0, so cov(az, z) = (az)z = az 2 . Now 

i n / 1 n \ V*' 

1 \ ^ 2 

n f— f 

by Holder's inequality applied to a and z 2 . Finally, use Condition The 
same argument can be used for cov(bz, z) and cov(cz,2:). 

Define Q as in (A3); recall (Al)-(A2). The residuals from the multiple 
regression are e — Qf by Lemma A. 2; according to usual procedures, 

(A18) a 2 = |e-Q/|7(n-4). 

Recall / from (A2), and Q,Q from (A3) and (13). 
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Lemma A. 3. Assume Conditions #l-#3, not Condition j^J±, and (11). 
Then \f\ 2 /n — > 1 and Q — > Q. Convergence is in probability. 

Proof. The first claim follows from (A5) and (A12); the second, from 
(A10) and Theorem 1. □ 

Proof of Theorem 3. Let M be the 4x4 matrix whose diagonal 
is Pa,Pb,PcA', the last row of M is (za,Zb,zcA)> the last column of M 
is (za, Zg, Zq, 1)' . Pad out M with O's. Plainly, X'X/n = M. As before, 
PA = nA/n is deterministic, andpA — > PA by (9). But za = 0(l/y/n); likewise 
for B,C. This proves (i). 

For (ii), e = e-Qf + Qf. But e-Qf ± f. So \e — Qf\ 2 = \e\ 2 - Q 2 \f\ 2 . 
Then 



n-4^ 2 \e-Qf\ 
a = 



2 



n n 

U|2 



n 

\Y\ 2 - \f\ 2 

-pa{y a ? -pb(Y b ? -pc(Yc) 2 - Q 2 — 
n n 

PA(a A ) 2 -PB(bB) 2 -Pc(cc? ~ Q 2 — 



n n 
by (Al) and (3). Using (3) again, we get 
|y|2 

(A19) ] —=PA{a 2 )A+PB{b 2 ) B +Pc{c 2 )c- 
n 

(Remember, the dummy variables are orthogonal.) So 

" 4 a 2 = p A [(a 2 ) A ~ (a A ) 2 } +Pb[{o 2 )b - (b B ) 2 } 
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(A20) 

+ Pc[(c 2 )c-(cc) 2 ]-Q 2 ^ 

n 

To evaluate lim<7 2 , we may without loss of generality assume Condi- 
tion ^4, by the invariance lemma. Now a a = 0(1 /y/n) and likewise for B, C 
by (A15). The terms in (A20) involving (aA) 2 , (bB) 2 , (cc) 2 can therefore be 
dropped, being 0(l/n). Furthermore, \ f\ 2 /n — * 1 and Q — > Q by Lemma A. 3. 
To complete the proof of (ii), we must show that, in probability, 

(A21) (a 2 ) A ^(a 2 ), ( b 2 ) B ^(b 2 ), (c 2 ) c ^(c 2 ). 

This follows from Condition $4 and Proposition 1. Given (i) and (ii), claim 
(hi) is immediate. □ 
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Proof of Theorem 4. The asymptotic variance of the multiple regres- 
sion estimator is given by Theorem 2. The variance of the ITT estimator 
Yq — Ya can be worked out exactly, from Proposition 1 (see Example 1). A 
bit of algebra will now prove Theorem 4. □ 

Proof of Theorem 5. By the invariance lemma, we may as well 
assume that a = b = c = 0. The ITT estimator is unbiased. By Lemma 
A. 2, the multiple regression estimator differs from the ITT estimator by 
Qza,Qzb,Qzc- These three random variables sum to by (11) and the 
balance condition. So their expectations sum to 0. Moreover, the three ran- 
dom variables are exchangeable, so their expectations must be equal. To see 
the exchangeability more sharply, recall (Al)-(A3). Because there are no 
interactions, Yi = S^. So 

(A27) e = 5 - 5 A U - 5 B V - 5 C W 

by (Al), and 

(A28) / = z - z A U - z B V - z c W 

by (A2). These are random n- vectors. The joint distribution of 

(A29) e,f,Q,z A ,z B ,z c 

does not depend on the labels A,B,C: the pairs (5i,Zi) are just being divided 
into three random groups of equal size. □ 

The same argument shows that the multiple regression estimator for an 
effect difference (like a — c) is symmetrically distributed around the true 
value. 

Proof of Theorem 6. By Lemma A.l, we may assume without loss of 
generality that a = b = c = 0. We can assign subjects to A, B, C by randomly 
permuting {1,2, ... ,n}: the first ua subjects go into A, the next n B into B, 
and the last nc into C. Freeze the number of A's, -B's — and hence C's — 
within each level of z. Consider only the corresponding permutations. Over 
those permutations, za is frozen; likewise for B, C. So the denominator of 
Q is frozen: without condition (11), the denominator must be computed 
from (A5). In the numerator, za,z b ,zq are frozen, while cla averages out to 
zero over the permutations of interest; so do b B and cq- With a little more 
effort, one also sees that {az) A averages out to zero, as do (bz) B ,(cz)c- 
In consequence, Qza has expectation 0, and likewise for B, C. Lemma A. 2 
completes the argument. □ 



20 



D. A. FREEDMAN 



Remarks, (i) What if |/| = in (A2)-(A3)? Then z is a linear com- 
bination of the treatment dummies U, V, W; the design matrix (UVWz) is 
singular, and the multiple regression estimator is ill-defined. This is not a 
problem for Theorems 2 or 3, being a low-probability event. But it is a prob- 
lem for Theorems 4 and 5. The easiest course is to assume the problem away, 
for instance, requiring 

I A z * s nnearr y independent of the treatment dummies for ev- 

ery permutation of {1,2, ... ,n}. 

Another solution is more interesting: exclude the permutations where |/| = 0, 
and show the multiple regression estimator is conditionally unbiased, that 
is, has the right average over the remaining permutations. 

(ii) All that is needed for Theorems 2-4 is an a priori bound on absolute 
third moments in Condition #1, rather than fourth moments; third moments 
are used for the CLT by Hoglund (1978). The new awkwardness is in proving 
results like (A21), but this can be done by familiar truncation arguments. 
More explicitly, let x\, . . . , x n be real numbers, with 

1 n 

(A31) -J2\^\ a <L. 

Tl . 1 
1=1 

Here, 1 < a < oo and < L < oo. As will be seen below, a = 3/2 is the 
relevant case. In principle, the x's can be doubly subscripted, for instance, 
x\ can change with n. We draw m times at random without replacement 
from {x\, . . . , x n }, generating random variables X\, . . . , X m . 

Proposition A.l. Under condition (A31), as n — > oo, ifm/n converges 
to a positive limit that is less than 1, then —{X\ + • • • + X m ) — E(Xi) con- 
verges in probability to 0. 



Proof. Assume without loss of generality that E{X{) = 0. Let M be a 
positive number. Let Ui = Xi when \Xi\ < M; else, let Ui = 0. Let V{ = Xi 
when \Xi\ > M; else, let V { = 0. Thus, Ui + V* = X { . Let \x = E(Ui), so 
E(Vi) = —fi. Now — (£/i + • • • + U m ) — n — > 0. Convergence is almost sure, 
and rates can be given; see, for instance, Hoeffding (1963). 

Consider next =r(Wi + • • • + W m ), where Wi = Vi + fi. The W, are ex- 
changeable. Fix (3 with 1 < j3 < a. By Minkowski's inequality, 



(A32) 



E 



W 1 + --- + W„ 



in 



VP 



<[Emn 



When M is large, the right-hand side of (A32) is uniformly small, by a 
standard argument starting from (A31). In essence, 



\Xt\>M 



\Xif< 



\Xi\>M 



\Xi\ a <L/M 



a-/3 
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□ 

In proving Theorem 2, we needed {az)A = 0(1). If there is an a priori 
bound on the absolute third moments of a and z, then (A31) will hold for 
Xi = aiZi and a = 3/2, by the Cauchy-Schwarz inequality. On the other hand, 
a bound on the second moments would suffice, by Chebyshev's inequality. 
To get (A21) from third moments, we would, for instance, set xi = af; again, 
a = 3/2. 
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